Introduction

Wine tasting can range from a casual pastime to a lucrative profession. For professional sommeliers, considerable time and training is required to adequately rate wine quality. Intuitively, we expect expert ratings to reflect the underlying chemical composition of the wines. To this end, we hypothesized it would be possible to predict ratings using statistical models based on the chemical makeup of wines.

Project Aim

To determine how accurately expert wine quality ratings can be predicted using a set of easily measured chemical components.

Data

Two datasets of expert quality ratings of red and white Vinho Verde wines were used. The data is obtained from http://archive.ics.uci.edu/ml/datasets/wine+quality. There are total of 1599 red wines and 4898 white wines in the two datasets.

The outcome variable is wine quality. This variable is an ordinal variable theoretically ranging from 0-10. However, the observed ratings only range from 3-9, where 0 indicates poor quality and 10 is for excellent quality. The data is highly unbalanced across quality classes.

Predictor variables are 11 physiochemical wine components: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol.

Below are the boxplots showing how each predictors are distributed across quality for red and white wine datasets.

Below plots show the correlation among the 11 variables for each red and white wine datasets, respectively. Here, the darker the blue, the more positively the variables are correlated and the darker the red, the more negatively the variables are correlated.

Methods

Data processing

For the analysis, the red and white wine datasets were split into training and testing sets. The training data was sampled as 80% of the total available data and the remaining 20% was used as a testing data. We had training and test datasets for each red and white wine. The training data was used to train the model and the test data was used to evaluate the prediction accuracy of the model. Since the quality variable was highly unbalanced in the full dataset, the relative frequencies of this variable were preserved in both training and test data.

Modeling Approach

We modeled wine quality ratings using the random forest machine learning method and three likelihood based models: linear, partial proportional odds, and multinomial regression. The details of this models are given below:

Linear Regression

Since the observed wine ratings were integers, it was unclear whether rating should be considered a continuous or ordinal random variable. As an initial step, we assumed the ratings were continuous and fit an OLS linear regression model as follows:

\[Z = X^T \beta + \epsilon,\ \text{where}\ \epsilon\sim N(0,\sigma^2 I)\] where \(Z\) is the response variable, quality, \(X\) is the design matrix with predictor variables, \(\beta\) are the regression coefficients, and we assume the response follows normal distribution with the distribution assumption for \(\epsilon\). It is well known that the least squares estimate of \(\beta\) is \[\hat\beta = (X^T X)^{-1}X^T Z.\] Since the predicted value of a linear model is continuous but the quality ratings were integers, we rounded the predicted linear regression values to the nearest integer to obtain the final predicted quality rating.

Partial Proportional Odds Models

To account for the ordinal nature of wine ratings without assuming continuity, we implemented three cumulative logit models: proportional odds, partial proportional odds, and non-proportional odds. Letting \(Z\) correspond to level of wine rating , \(i\) correspond to a specific wine, and \(x_i\) correspond to the chemical components of wine i, the models are given below:

  • Non-proportional odds: \(logit(P(Z \leq j|x)) = \alpha_j + x_i^T\beta_j\)
  • Proportional odds: \(logit(P(Z \leq j|x)) = \alpha_j + x_i^T\beta\)
  • Partial proportional odds: \(logit(P(Z \leq j|x)) = \alpha_j + x^T_{i*}\beta + x^T_{i**}\beta_j\)

Where \(\alpha_j > \alpha_i\) for \(j > i\), and \(x^T_{i*}\) and \(x^T_{i**}\) are submatrices of \(x_i\).

The log likelihood for each model was maximized using the BFGS algorithm as implemented in the R package optimx (alternatively, users of our R package could specify any available method implemented in optimx). For models that converged, the predicted classification was taken to be the rating with the highest predicted probability. An S3 predict method was implemented in sommelieR to facilitate these predictions.

Multinomial Regression

We were also interested in testing if we could relax the ordinality assumption of our data and still retain predictive accuracy. Therefore, we decided to fit a multinomial regression model to our data as well. Consider a random variable, Y, which takes on values in 1,…,K classes, with covariates X. The multinomial regression model has the form,

\[\log \frac{P(Y = k | X = x)}{P(Y = K | X = x)} = X\beta_k,\ \ k = 1,...,(K-1)\]

where \(\beta_k\) is a \(p \times 1\) vector of regression coefficient corresponding to level, k. Additionally, we apply the constraint that the probabilities sum to one, i.e. \(\sum_{i = 1}^K P(Y = i | X = x) = 1\). To get the probabilties we used for preduction, we solve the above system of equations given the above constraint:

\[P(Y = k | X = x) = \frac{\exp(X\beta_k)}{1 + \sum_{l = 1}^{K - 1}\exp(X\beta_l)},k = 1,..., K-1 \\ P(Y = K | X = x) = \frac{1}{1 + \sum_{l = 1}^{K - 1}\exp(X\beta_l)}\]

This notation was taken from Elements of Statistical Learning, 2009 (Hastie, Tibshirani, and Friedman). For both red and white wine we fit three multinomial regression models:

  • All Predictors with Linear Terms
  • Reduced Model with Linear Terms
  • Reduced Model with All Second Order Terms

where the coefficients in the reduced model - and how they were selected - are mentioned below. While we would have liked to fit models with higher order terms (quadratic terms, interactions), with our current implementation of multinomial regression - this was impractical. For a response variable taking on K classes, each new variable added to the model adds (K-1) coefficients, which made fitting a full second order model challenging without binning our response variable. In the future, this challenge could be overcome through programming in C++ or using a penalization for each row of the coefficent matrix (since you could not simply penalize each individual coefficient).

Random Forest

  • Ensembles of decision trees which are built by bootstrapping the observations
  • Final prediction is made by a majority vote of the ensemble of trees

Variable Selection

For the random forest models, we used all 11 physiochemical variables as predictors. For the likelihood based models, we considered full and reduced models. The full models for both red and white wines included all 11 physiochemical variables. The variables in the reduced models were determined by looking at the correlations between predictors and using best subsets based on OLS regression. The covariates selected for the red model were volatile acidity, total sulfur dioxide, pH, alcohol, and sulphates. The covariates selected for the white model were pH, volatile_acidity, residual_sugar, and alcohol.

Model Evaluation

We compared the four models on overall accuracy (correct predictions/total), kappa, and weighted kappa. Kappa is a commonly used statistic for capturing how well the classification is done compared to a 50% random chance classification. Weighted kappa is an extension of kappa that is a more useful for data with inherent ordering since it penalizes misclassifications proportional to the distance from the true category. For instance, when the true quality rating is 4, a prediction of 7 will be penalized more severely than a prediction of 5. Since our outcome variable, quality, is ordered, we used the weighted Kappa to select the final model.

Results: Linear Model

The predictive performance of the full and reduced linear regression models on the testing data are given below:

Overall Results
Percent Correct by Category
Accuracy Kappa Weighted Kappa 3 4 5 6 7 8 9
Full (Red) 57.73 0.2998 0.4996 0 0 67.65 64.57 23.08 0 NA
Reduced (Red) 56.15 0.2728 0.4545 0 0 65.44 63.78 20.51 0 NA
Full (White) 52.61 0.2162 0.4211 0 0 39.86 81.78 22.16 0 0
Reduced (White) 51.02 0.1975 0.3943 0 0 43.99 77.22 18.18 0 0

Below is a visualization of the confusion matrix for red wine. We can see that the linear model tends to predict most of the quality towards the mean value since the model captures the population mean and also the data is highly unbalanced with most of the data concentrated around the 5 and 6 quality ratings.

This concentration towards the mean trend is more distinct with the white wine, where most of the prediction is 6.

Results: Partial Proportional Odds Model (White Wine)

For models fit on the white wine training data, only proportional odds models converged to parameter estimates that produced non-negative fitted probabilities for each subject and quality rating. The code for fitting these models is given below (note that the models objects are saved to shorten the compile time for this report):

#get starting values for model beta's
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar,
                       data = white_train))

#proportional odds, reduced model 
white.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = white_train,
                              prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar,
                              beta.prop.odds.start = beta.starts[c(2:5)],
                              method = "BFGS")
saveRDS(white.prop.odds, "proportional_odds_models/white.prop.odds.reduced.rds")


#proportional odds, full model 
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar + fixed.acidity + citric.acid +
                         chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates,
                       data = white_train))

white.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = white_train,
                              prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
                                fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide + 
                                total.sulfur.dioxide + density + sulphates,
                              beta.prop.odds.start = beta.starts[c(2:12)],
                              method = "BFGS")

saveRDS(white.prop.odds, "proportional_odds_models/white.prop.odds.rds")

Then using the predict method implemented for this class, we predicted the ratings in the testing data as follows:

#load saved models 
white.prop.odds.reduced <- readRDS("proportional_odds_models/white.prop.odds.reduced.rds")
white.prop.odds <- readRDS("proportional_odds_models/white.prop.odds.rds")

#predict values 
white.preds <- predict(white.prop.odds, white_test)$most.likely
white.preds.reduced <- predict(white.prop.odds.reduced, white_test)$most.likely

The performance for the full and reduced models are given below. Notably, the full model performed better than the reduced, but only very slightly.

Overall Results
Percent Correct by Category
Prediction Accuracy Kappa Weighted Kappa 3 4 5 6 7 8 9
Proportional Odds (F) 52.1472 0.2126 0.4053 0 0 45.3608 78.1321 19.8864 0 0
Proportional Odds (R) 51.7382 0.2108 0.3993 0 0 51.2027 74.9431 15.9091 0 0

Likewise, the confusion matrix for the full model, using the plotting functionality implemented in our package, is given below:

#plotting confusion matrix 
cmPlot(white.pred.table, "white", pred_first = T, title = "Proportional Odds Model Predicted vs Observed", axis.title.size = 20, axis.text.size = 25)

Results: Partial Proportional Odds Model (Red Wine)

For models fit to the red training data, both proportional and partial proportional models converged. The results presented for the partial proportional model allow the coefficient for total sulfur dioxide to vary with the level of wine quality. The code for producing these models is given below:

#reduced red wine models 
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + sulphates + total.sulfur.dioxide,
                       data = red_train))

red.partial.prop.reduced <- partial.prop.odds.mod(y ="quality", in.data = red_train,
                              prop.odds.formula = ~ alcohol + pH+ volatile.acidity + sulphates,
                              beta.prop.odds.start = beta.starts[2:5],
                              non.prop.odds.formula = ~total.sulfur.dioxide,
                              beta.non.prop.odds.start = matrix(rep(beta.starts[6], 5), nrow = 1),
                              method = "BFGS")

red.prop.odds.reduced <- partial.prop.odds.mod(y ="quality", in.data = red_train,
                              prop.odds.formula = ~ alcohol + pH + volatile.acidity + sulphates
                              + total.sulfur.dioxide,
                              beta.prop.odds.start = beta.starts[2:6],
                              method = "BFGS")
saveRDS(red.partial.prop.reduced, "proportional_odds_models/red.partial.prop.reduced.rds")
saveRDS(red.prop.odds.reduced, "proportional_odds_models/red.prop.odds.reduced.rds")


#full models 
beta.starts <- coef(lm(quality ~ alcohol+ pH + volatile.acidity + residual.sugar + fixed.acidity + citric.acid +
                         chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates,
                       data = red_train))

#partial proportional odds model
red.partial.prop <- partial.prop.odds.mod(y ="quality", in.data = red_train,
                              prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
                                fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide +
                                density + sulphates,
                              beta.prop.odds.start = beta.starts[c(2:9, 11:12)],
                              non.prop.odds.formula = ~total.sulfur.dioxide,
                              beta.non.prop.odds.start = matrix(rep(beta.starts[10], 5), nrow = 1),
                              method = "BFGS")

#proportional odds model
red.prop.odds <- partial.prop.odds.mod(y ="quality", in.data = red_train,
                              prop.odds.formula = ~ alcohol+ pH + volatile.acidity + residual.sugar +
                                fixed.acidity + citric.acid + chlorides + free.sulfur.dioxide +
                                total.sulfur.dioxide + density + sulphates,
                              beta.prop.odds.start = beta.starts[2:12],
                              method = "BFGS")

saveRDS(red.partial.prop, "proportional_odds_models/red.partial.prop.rds")
saveRDS(red.prop.odds, "proportional_odds_models/red.prop.odds.rds")

Predictions on the testing data were again made using our S3 predict method (code as shown for white wines). The performance of the trained models on the testing data are given below. Briefly, the proportional and partial proportional odds models performed similarly, and the full models performed very slightly better than the reduced models.

Overall Results
Percent Correct by Category
Prediction Accuracy Kappa Weighted Kappa 3 4 5 6 7 8
Proportional Odds (F) 58.9905 0.3232 0.5258 0 0 72.0588 59.8425 33.3333 0
Proportional Odds (R) 58.0442 0.3065 0.4707 0 0 72.0588 59.8425 25.6410 0
Partial Proportional Odds (F) 58.9905 0.3233 0.5162 0 0 72.0588 60.6299 30.7692 0
Partial Proportional Odds (R) 58.3596 0.3117 0.4742 0 0 72.0588 59.8425 28.2051 0

As with the white wines, the confusion matrices for the full proportional and partial proportional red models are shown below:

Results: Multinomial Regression (Red Wine Quality Classification)

Here we fit the multinomial regression models using the fit_multinomial_regression function from the sommelieR package. This chunk isn’t evaluated for ease of compiling the final report - but obviously would be in a paper that was more concerned with reproducibility.

#Begin multinomiial Results Section

set.seed(13)

#Fit full model for red wine
red_train_full_fit <- fit_multinomial_regression(red_train, quality ~ 1 + ., ref_level = "8", trace = 1, itters = 300)

#Fit reduced linear for red wine
red_train_reduced_linear_fit <- fit_multinomial_regression(red_train, quality ~ 1 + alcohol + volatile.acidity + total.sulfur.dioxide + pH + sulphates, ref_level = "8", trace = 1, itters = 300)

#Fit reduced quadratic for red wine
red_train_reduced_all_quad_fit <- fit_multinomial_regression(red_train, quality ~ 1 + alcohol + volatile.acidity + total.sulfur.dioxide + pH+ I(alcohol*alcohol) + I(volatile.acidity*volatile.acidity) + I(pH*pH) + I(total.sulfur.dioxide*total.sulfur.dioxide) + I(sulphates*sulphates) + (alcohol + volatile.acidity + total.sulfur.dioxide + pH + sulphates)^2, ref_level = "8", trace = 1, itters = 300)

#Fit full linear for white wine
white_train_full_fit <- fit_multinomial_regression(white_train, quality ~ 1 + ., ref_level = "8", trace = 1, itters = 300)

#Fit reduced lienar for white wine
white_train_linear_fit <- fit_multinomial_regression(white_train, quality ~ 1 + pH + volatile.acidity + residual.sugar + alcohol, ref_level = "8", trace = 1, itters = 300)

#Fit reduced with quadratic terms for white wine
#Takes a long time to run
white_train_quad_fit <- fit_multinomial_regression(white_train, quality ~ 1 + pH + volatile.acidity + residual.sugar + alcohol + (pH + volatile.acidity + residual.sugar + alcohol)^2 + I(pH*pH) + I(volatile.acidity*volatile.acidity) + I(residual.sugar*residual.sugar) + I(alcohol*alcohol), ref_level = "8", trace = 1, itters = 300)

#Save models because they take a while to run
saveRDS(red_train_full_fit, "multinomial_models/red_full_fit.rds")
saveRDS(red_train_reduced_linear_fit, "multinomial_models/red_reduced_linear_fit.rds")
saveRDS(red_train_reduced_all_quad_fit, "multinomial_models/red_reduced_all_quad_fit.rds")
saveRDS(white_train_full_fit, "multinomial_models/white_full_fit.rds")
saveRDS(white_train_linear_fit, "multinomial_models/white_reduced_linear_fit.rds")
saveRDS(white_train_quad_fit, "multinomial_models/white_reduced_quad_fit.rds")

Here we fit predictions on the test set using the predict_multinomial function from the sommelieR package.

#Fit predictions
red_test_full_preds <- predict_multinomial(red_train_full_fit, red_test)
red_test_reduced_linear_preds <- predict_multinomial(red_train_reduced_linear_fit, red_test)
red_test_reduced_quad_preds <- predict_multinomial(red_train_reduced_all_quad_fit, red_test)
white_test_full_preds <- predict_multinomial(white_train_full_fit, white_test)
white_test_linear_preds <- predict_multinomial(white_train_linear_fit,
                                              white_test)
white_test_quad_preds <- predict_multinomial(white_train_quad_fit, white_test)

And here we use a convenient helper function pred_summary_stats to assemble our final table. It’s from the sommelieR package - soon to be on CRAN.

multinomial_red_results <- bind_rows(
  pred_summary_stats(red_test_full_preds, red_test$quality, "Full Model (Linear Terms)"),
  pred_summary_stats(red_test_reduced_linear_preds, red_test$quality, "Reduced Model (Linear Terms)"),
   pred_summary_stats(red_test_reduced_quad_preds, red_test$quality, "Reduced Model (Second Order Terms)")
) %>% dplyr::arrange(desc(`Weighted Kappa`)) 
colnames(multinomial_red_results)[2] <- "Accuracy"

multinomial_red_results %>% 
  kable(caption = "Comparison of Multinomial Regression Models for Red Wine Quality", 
        booktabs = T, digits = 4) %>%
        add_header_above(c(" ", "Overall Results" = 3, "Percent Correct by Category" = 6)) %>%
        kable_styling(latex_options = c("repeat_header", "scale_down"))

#Remove "model" because it's not in the big table. 
multinomial_best_red_model_results <- multinomial_red_results %>% slice(1) %>% 
  select(-Model)

We see from the above table that the full model with linear predictors has the best weighted kappa of all considered multinomial regression models for red wine. It is our best model and will represent multinomial regression in the final comparison. Its confusion matrix is visualized below:

Results: Multinomial Regression (White Wine Quality Classification)

The same functions were again used to construct the table:

multinomial_white_results <- bind_rows(
  pred_summary_stats(white_test_full_preds, white_test$quality, "Full Model (Linear Terms)"),
  pred_summary_stats(white_test_linear_preds, white_test$quality, "Reduced Model (Linear Terms)"),
   pred_summary_stats(white_test_quad_preds, white_test$quality, "Reduced Model (Second Order Terms)"))%>% 
  dplyr::arrange(desc(`Weighted Kappa`))
colnames(multinomial_white_results)[2] <- "Accuracy"

multinomial_white_results %>% 
  kable(caption = "Comparison of Multinomial Regression Models for White Wine Quality", 
        booktabs = T, digits = 4) %>%
        add_header_above(c(" ", "Overall Results" = 3, "Percent Correct by Category" = 7)) %>%
        kable_styling(latex_options = c("repeat_header", "scale_down"))
Comparison of Multinomial Regression Models for White Wine Quality
Overall Results
Percent Correct by Category
Model Accuracy Kappa Weighted Kappa 3 4 5 6 7 8 9
Full Model (Linear Terms) 54.0900 0.2451 0.4121 0 9.375 51.2027 79.4989 15.9091 0 0
Reduced Model (Linear Terms) 51.9427 0.2201 0.4093 0 3.125 54.6392 72.6651 16.4773 0 0
Reduced Model (Second Order Terms) 53.2720 0.2396 0.4010 0 3.125 54.2955 74.4875 19.8864 0 0
multinomial_white_best_model_results <- multinomial_white_results %>%
  slice(1) %>% 
  select(-Model)

We see from the above table that the full model with linear predictors has the best weighted kappa of all considered multinomial regression models for white wine. It is our best model and will represent multinomial regression in the final comparison for white wines. Its confusion matrix is visualized below:

Results: Multinomial Regression - Full Model Confusion Matrix (White Wine)

Results: Random Forest

Random Forest Results for Red and White Wine
Overall Results
Percent Correct by Category
Prediction Accuracy Kappa Weighted Kappa 3 4 5 6 7 8 9
Red Wine 70.98 0.5263 0.6168 0 0 83.82 70.87 53.85 0.00 NA
White Wine 67.28 0.4862 0.6542 0 25 67.35 80.87 47.73 42.86 0

Results: Random Forest (Variable Importance)

Results: Random Forest (Red Wine)

Results: Random Forest (White Wine)

Discussion

Comparisons of the performance of the different modeling approaches are given below:

Comparison of Results for Red Wine

Overall Results
Percent Correct by Category
Model Prediction Accuracy Kappa Weighted Kappa 3 4 5 6 7 8
Random Forest 70.9800 0.5263 0.6168 0 0 83.8200 70.8700 53.8500 0
Proportional Odds 58.9905 0.3232 0.5258 0 0 72.0588 59.8425 33.3333 0
Multinomial 58.3596 0.3158 0.5227 0 20 74.2647 55.9055 28.2051 0
Partial Proportional Odds 58.9905 0.3233 0.5162 0 0 72.0588 60.6299 30.7692 0
Linear Regression 57.7300 0.2998 0.4996 0 0 67.6500 64.5700 23.0800 0

Comparison of Results: White Wine

Overall Results
Percent Correct by Category
Model Prediction Accuracy Kappa Weighted Kappa 3 4 5 6 7 8 9
Random Forest 67.2800 0.4862 0.6542 0 25.000 67.3500 80.8700 47.7300 42.86 0
Linear Regression 52.6100 0.2162 0.4211 0 0.000 39.8600 81.7800 22.1600 0.00 0
Multinomial 54.0900 0.2451 0.4121 0 9.375 51.2027 79.4989 15.9091 0.00 0
Proportional Odds 52.1472 0.2126 0.4053 0 0.000 45.3608 78.1321 19.8864 0.00 0

Using random forests, we predicted expert wine quality ratings fairly accurately. For white wines, the overall accuracy was 67.28%, with a weighted kappa of 0.6542. For red wines, the accuracy was 70.98% with weighted kappa 0.6168. Using the proposed cutoffs from Landis and Koch, these weighted kappa values suggest moderate to substantial agreement with the expert ratings.

The likelihood approaches performed more poorly. Accuracies were 10-15% lower than random forest and weighted kappas were 0.1 to 0.2 lower. Moreover, among the likelihood based models, predictive performance was similar and no individual approach stood out as superior. Generally, the models showed a tendency to predict ratings towards the middle of the distribution, and, interestingly, accounting for the ordered nature of the ratings did not make a substantial difference in performance. Indeed, the multinomial was more accurate than the proportional odds and linear regression models for very low and high quality wines. Overall, the weighted kappas for the likelihood based models suggested moderate agreement with expert ratings.

Our project has limitations. We only had data on a small number of chemical components, so we speculate some misclassification was due to failure to include other important chemicals in our models. Likewise, our datasets included only small numbers of extremely poor and very excellent wines making these difficult to predict. In the future, it would be worthwhile to examine whether various outlier detection algorithms are more suited to characterizing these wines than the approaches we took.

In terms of variable selection for the likelihood based models, we saw only a minor improvement in predictive performance for the full compared to reduced models suggesting we may have overfit the data. A regularization method like LASSO or elastic net with cross-validation may have been useful. However, it is somewhat unclear how best to implement a penalization method for the partial proportional odds models and the multinomial regression models - since each coefficient for a given class and variable is related to the other coefficients for a class. Therefore, some sort of grouped variable selection method would need to be implemented for these models. Additionally, our modeling focus was purely on prediction, not on characterizing how individual chemicals relate to wine quality. An interesting extension of this project would more thoroughly investigate those relationships.

In summary, we found wine quality ratings can be predicted reasonably well based on their chemical components. Random forests performed better than likelihood based models, but all methods demonstrated at least moderate agreement with expert ratings. Nevertheless, the predictions were far from perfect, so true wine connoisseurs are still better off consulting a sommelier.

References

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977 Mar;33(1):159-74.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. 2nd ed. New York: Springer.